rmarkdown::paged_table(NFL2022_stuffs)
1 Question 2
1.1 Summary Statistics
skim(NFL2022_stuffs) %>%
select(-n_missing)| Name | NFL2022_stuffs |
| Number of rows | 50147 |
| Number of columns | 10 |
| _______________________ | |
| Column type frequency: | |
| character | 2 |
| numeric | 8 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|
| game_id | 1.00 | 13 | 15 | 0 | 284 | 0 |
| posteam | 0.93 | 2 | 3 | 0 | 32 | 0 |
Variable type: numeric
| skim_variable | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|
| play_id | 1.00 | 2057.86 | 1194.22 | 1 | 1039.00 | 2034.00 | 3065.50 | 5523 | ▇▇▇▅▁ |
| drive | 0.99 | 11.48 | 6.59 | 1 | 6.00 | 11.00 | 17.00 | 35 | ▇▇▇▂▁ |
| week | 1.00 | 9.91 | 5.61 | 1 | 5.00 | 10.00 | 15.00 | 22 | ▇▆▆▆▃ |
| qtr | 1.00 | 2.58 | 1.14 | 1 | 2.00 | 3.00 | 4.00 | 5 | ▆▇▆▇▁ |
| down | 0.83 | 2.00 | 1.00 | 1 | 1.00 | 2.00 | 3.00 | 4 | ▇▆▁▃▂ |
| half_seconds_remaining | 1.00 | 796.94 | 564.41 | 0 | 255.00 | 774.00 | 1285.00 | 1800 | ▇▅▅▅▅ |
| pass | 1.00 | 0.45 | 0.50 | 0 | 0.00 | 0.00 | 1.00 | 1 | ▇▁▁▁▆ |
| wp | 0.99 | 0.51 | 0.29 | 0 | 0.29 | 0.52 | 0.73 | 1 | ▆▆▇▆▆ |
1.2 Q2a
In data.frame, NFL2022_stuffs, remove observations for which values of posteam is missing
NFL2022_stuffs <- na.omit(NFL2022_stuffs[!is.na(NFL2022_stuffs$posteam), ])
rmarkdown::paged_table(NFL2022_stuffs)1.3 Q2b
Summarize the mean value of pass for each posteam when all of the following conditions hold: 1. wp is greater than 20% and less than 75% 2. down is less than or equal to 2 3. half_seconds_remaining is greater than 120
filtered_data <- NFL2022_stuffs[NFL2022_stuffs$wp > 0.20 & NFL2022_stuffs$wp < 0.75 &
NFL2022_stuffs$down <= 2 &
NFL2022_stuffs$half_seconds_remaining > 120, ]
mean_pass_by_posteam <- aggregate(filtered_data$pass, by=list(filtered_data$posteam), FUN=mean)
colnames(mean_pass_by_posteam) <- c("posteam", "mean_pass")
print(mean_pass_by_posteam) posteam mean_pass
1 ARI 0.5528455
2 ATL 0.4000000
3 BAL 0.5198330
4 BUF 0.6043956
5 CAR 0.4578947
6 CHI 0.4198312
7 CIN 0.6567460
8 CLE 0.4908722
9 DAL 0.4742647
10 DEN 0.4930796
11 DET 0.4906542
12 GB 0.5088496
13 HOU 0.4793388
14 IND 0.4938525
15 JAX 0.5207921
16 KC 0.6376068
17 LA 0.5104895
18 LAC 0.6076190
19 LV 0.4921569
20 MIA 0.5334646
21 MIN 0.5555556
22 NE 0.5208333
23 NO 0.4214464
24 NYG 0.5153846
25 NYJ 0.5061728
26 PHI 0.5801217
27 PIT 0.4796296
28 SEA 0.5662188
29 SF 0.4805726
30 TB 0.5529412
31 TEN 0.4342723
32 WAS 0.4054581
1.4 Q2c
Provide both (1) a ggplot code with geom_point() using the resulting data.frame in Q2b and (2) a simple comments to describe the mean value of pass for each posteam. In the ggplot, reorder the posteam categories based on the mean value of pass in ascending or in descending order
library(ggplot2)
mean_pass_by_posteam$posteam <- factor(mean_pass_by_posteam$posteam,
levels = mean_pass_by_posteam$posteam[order(mean_pass_by_posteam$mean_pass)])
ggplot(mean_pass_by_posteam, aes(x = mean_pass, y = posteam)) +
geom_point() +
labs(x = "Percent of Pass Plays", y = "Team with Possession", title = "Team vs Percent of Pass Plays") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
1.5 Q2d
Consider the following data.frame:
NFL2022_epa <- read_csv('https://bcdanl.github.io/data/NFL2022_epa.csv')
rmarkdown::paged_table(NFL2022_epa)Create the data.frame, NFL2022_stuffs_EPA, that includes:
- All the variables in the data.frame,
NFL2022_stuffs - The variables,
passer,receiver, andepa, from the data.frame,NFL2022_epaby joining the two data.frames
In the resulting data.frame, NFL2022_stuffs_EPA, remove observations with NA in passer
NFL2022_epa <- read_csv('https://bcdanl.github.io/data/NFL2022_epa.csv')
NFL2022_stuffs_EPA <- merge(NFL2022_stuffs, NFL2022_epa[, c("game_id", "passer", "receiver", "epa")], by = "game_id")
NFL2022_stuffs_EPA <- NFL2022_stuffs_EPA[complete.cases(NFL2022_stuffs_EPA$passer), ]
rmarkdown::paged_table(NFL2022_stuffs_EPA)1.6 Q2e
Provide both (1) a single ggplot and (2) a simple comment to describe the NFL weekly trend of weekly mean value of epa for each of the following two passers: 1. "J.Allen" 2. "P.Mahomes"
library(ggplot2)
selected_passers <- c("J.Allen", "P.Mahomes")
filtered_data <- NFL2022_stuffs_EPA[NFL2022_stuffs_EPA$passer %in% selected_passers, ]
filtered_data$week <- factor(filtered_data$week, levels = unique(filtered_data$week))
weekly_mean_epa <- aggregate(epa ~ week + passer, data = filtered_data, FUN = mean)
ggplot(weekly_mean_epa, aes(x = week, y = epa, color = passer, group = passer)) +
geom_line() +
labs(x = "Week", y = "Weekly Mean EPA", title = "NFL Weekly Trend of Mean EPA for J.Allen and P.Mahomes") +
theme_minimal()
Patrick Mahomes generally has a higher weekly mean epa than Josh Allen
1.7 Q2f
Calculate the difference between the mean value of epa for "J.Allen" the mean value of epa for "P.Mahomes" for each value of week.
selected_passers <- c("J.Allen", "P.Mahomes")
filtered_data <- NFL2022_stuffs_EPA[NFL2022_stuffs_EPA$passer %in% selected_passers, ]
mean_epa_by_week <- aggregate(epa ~ week + passer, data = filtered_data, FUN = mean)
epa_diff <- reshape(mean_epa_by_week, idvar = "week", timevar = "passer", direction = "wide")
epa_diff$epa_diff <- epa_diff$epa.J.Allen - epa_diff$epa.P.Mahomes
print(epa_diff) week epa.J.Allen epa.P.Mahomes epa_diff
1 1 0.52963415 0.69840404 -0.16876989
2 2 0.48691617 0.14841216 0.33850401
3 3 0.16932725 0.24559401 -0.07626677
4 4 0.19104682 0.27137549 -0.08032867
5 5 0.62742248 0.30228470 0.32513777
6 6 0.30652151 0.13313721 0.17338430
7 8 0.22419910 NA NA
8 9 -0.20799939 0.09646711 -0.30446651
9 10 0.16051785 0.58904325 -0.42852541
10 11 0.19206366 0.36503570 -0.17297205
11 12 0.09828258 0.24726968 -0.14898710
12 13 0.33021344 0.20622354 0.12398990
13 14 -0.06207961 0.13106472 -0.19314433
14 15 0.25693067 0.32195856 -0.06502788
15 16 0.02143551 0.12156763 -0.10013212
16 18 0.20865931 0.17297609 0.03568322
17 19 -0.20950326 NA NA
18 20 -0.04289048 0.27933023 -0.32222071
25 7 NA 0.70130690 NA
34 17 NA 0.19847047 NA
37 21 NA 0.19610416 NA
38 22 NA 0.55937371 NA
1.8 Q2g
Summarize the resulting data.frame in Q2d, with the following four variables:
posteam: String abbreviation for the team with possession.passer: Name of the player who passed a ball to a receiver by initially taking a three-step drop, and backpedaling into the pocket to make a pass. (Mostly, they are quarterbacks.)mean_epa: Mean value of epa in 2022 for each passern_pass: Number of observations for each passer
Then find the top 10 NFL passers in 2022 in terms of the mean value of epa, conditioning that n_pass must be greater than or equal to the third quantile level of n_pass
library(dplyr)
summary_data <- NFL2022_stuffs_EPA %>%
group_by(posteam, passer) %>%
summarise(mean_epa = mean(epa),
n_pass = n())
third_quantile <- quantile(summary_data$n_pass, 0.75)
top_passers <- summary_data %>%
filter(n_pass >= third_quantile) %>%
top_n(10, mean_epa) %>%
arrange(desc(mean_epa))
print(top_passers)# A tibble: 183 × 4
# Groups: posteam [32]
posteam passer mean_epa n_pass
<chr> <chr> <dbl> <int>
1 CHI K.Cousins 0.474 3676
2 CHI J.Goff 0.473 4036
3 SEA B.Purdy 0.466 4736
4 LA J.Garoppolo 0.458 4088
5 JAX P.Mahomes 0.456 5237
6 BUF J.Brissett 0.442 3744
7 CAR J.Goff 0.359 3900
8 IND T.Lawrence 0.326 4428
9 TEN T.Lawrence 0.290 6042
10 MIN J.Goff 0.288 6299
# ℹ 173 more rows